Part 1: Time Series Wrangling & Forecasting

Read in energy data and convert to a tsibble

energy <- read_csv(here("data", "energy.csv"))
  1. Add a new column (date) that is the current month column converted to a time series class, yearmonth
  2. Convert the data frame to a tsibble, with that date column as the time index
energy_ts <- energy %>% 
  mutate(date = tsibble::yearmonth(month)) %>% 
  as_tsibble(key = NULL, index = date)

Exploratory time series visualization

ggplot(data = energy_ts, aes(x = date, y = res_total)) +
  geom_line() +
  labs(y = "Residentail energy consumption \n (Trillion BTU)")

Take-aways from this plot: - Overall increasing trend overall, but stability (and possibly a slight decreasing trend) starting around 2005 - Clear seasonality, with a dominant seasonal feature and also a secondary peak each year - that secondary peak has increased substantially - No notable cyclicality or outliers

Seasonplot:

# use feasts::gg_season() to create an exploratory seasonplot, which has month on the x-axis, energy consumption on the y-axis, and each year is its own series (mapped by line color).

energy_ts %>% 
  gg_season(y = res_total) +
  theme_minimal() +
  labs(x = "month",
       y = "residential energy consumption (trillion BTU)")

Take-aways from this plot: - The highest residential energy usage is around December / January / February - There is a secondary peak around July & August (that’s the repeated secondary peak we see in the original time series graph) - We can also see that the prevalence of that second peak has been increasing over the course of the time series: in 1973 (orange) there was hardly any summer peak. In more recent years (blue/magenta) that peak is much more prominent.

Subseries plot:

energy_ts %>% gg_subseries(res_total)

Take-away here: there is clear seasonality (higher values in winter months), with an increasingly evident second peak in June/July/August. This reinforces our takeaways from the raw data and seasonplots.

Decomposition (here by STL)

STL decomposition allows seasonality to vary over time (a major difference from classical decomposition, and important here since we do see changes in seasonality).

# Find STL decomposition
dcmp <- energy_ts %>% 
  model(STL(res_total ~ season()))

# View the ocmponents: components(dcmp)

# Visualize the decomposed components
components(dcmp) %>%  autoplot() +
  theme_minimal()

Autocorrelaiton function (ACF)

We use the ACF to explore autocorrelation (here, we would expect seasonality to be clear from the ACF):

energy_ts %>% 
  ACF(res_total) %>% 
  autoplot()

We see that observations separated by 12 months are the most highly correlated, reflecting strong seasonality we see in all of our other exploratory visualizations.

Forecasting by Holt_Winters exponential smoothing

To create the model below, we specify the model type (exponential smoothing, ETS), then tell it what type of seasonality it should assume using the season("“) expression, where “N” = non-seasonal (try changing it to this to see how unimpressive the forecast becomes!), “A” = additive, “M” = multiplicative. Here, we’ll say seasonality is multiplicative due to the change in variance over time and also within the secondary summer peak:

# Create the model
energy_fit <- energy_ts %>% 
  model(
    ets = ETS(res_total ~ season("M"))
  )

# Forecast using the model 10 years into the future:
energy_forecast <- energy_fit %>% 
  forecast(h = "10 years")

#Plot just the forecasted values (with 80 & 95% CIs):
energy_forecast %>% 
  autoplot()

# Or plot it added to the original data:
energy_forecast %>% 
  autoplot(energy_ts)

Assessing residuals:

Use broom::augment() to append our original tsibble with what the model predicts the energy usage would be based on the model.

# Append the predicted values (and residuals) to original energy data
energy_predicted <- broom::augment(energy_fit)

Now, plot the actual energy values (res_total), and the predicted values (stored as .fitted) atop them:

ggplot(data = energy_predicted) +
  geom_line(aes(x = date, y = res_total)) +
  geom_line(aes(x = date, y = .fitted), color = "red")

Residuals should be: - uncorrelated - centered at 0 - ideally normally distributed One way we can check the distribution is with a histogram:

ggplot(data = energy_predicted, aes(x = .resid)) +
  geom_histogram()

Other forecasting methods

There are a number of other forecasting methods and models! You can learn more about ETS forecasting, seasonal naive (SNAIVE) and autoregressive integrated moving average (ARIMA) from Hyndman’s book - those are the models that I show below.

# Fit 3 different forecasting models (ETS, ARIMA, SNAIVE):
energy_fit_multi <- energy_ts %>% 
  model(
    ets = ETS(res_total ~ season("M")),
    arima = ARIMA(res_total),
    snaive = SNAIVE(res_total)
  )

# Forecast 3 years into the future (from data end date):
multi_forecast <- energy_fit_multi %>% 
  forecast(h = "3 years")

# Plot the 3 forecasts:
multi_forecast %>% 
  autoplot(energy_ts)

# Or just view the forecasts (note the similarity across models):
multi_forecast %>% 
  autoplot()

Part 2: Spatial Dat Wrangling, Vizualization & a Variogram

California county outlines (polygons)

Read in the data:

ca_counties <- read_sf(here("data", "ca_counties", "CA_Counties_TIGER2016.shp"))

Simplify it by only keeping two attributes: NAME (county name) and ALAND (land area), then renaming those to county_name and land_area

ca_subset <- ca_counties %>% 
  select(NAME, ALAND) %>% 
  rename(county_name = NAME, land_area = ALAND)

Check and set the CRS

Use st_crs() to check the existing CRS for spatial data. We see that this CRS is WGS84 (epsg: 3857).

ca_subset %>%  st_crs()
## Coordinate Reference System:
##   User input: WGS 84 / Pseudo-Mercator 
##   wkt:
## PROJCRS["WGS 84 / Pseudo-Mercator",
##     BASEGEOGCRS["WGS 84",
##         DATUM["World Geodetic System 1984",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4326]],
##     CONVERSION["Popular Visualisation Pseudo-Mercator",
##         METHOD["Popular Visualisation Pseudo Mercator",
##             ID["EPSG",1024]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["False easting",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World - 85°S to 85°N"],
##         BBOX[-85.06,-180,85.06,180]],
##     ID["EPSG",3857]]

Plot the California counties using geom_sf(). Notice that we can update aesthetics just like we would for a regular ggplot object. Here, we update the color based on land area (and change the color gradient).

ggplot(data = ca_subset) +
  geom_sf(aes(fill = land_area), color = "white", size = 0.1) +
  theme_void() +
  scale_fill_gradientn(colors = c("cyan", "blue", "purple"))

Invasive red sesbania records (spaital points)

Read in the data:

sesbania <- read_sf(here("data","red_sesbania", "ds80.shp"))

# Then check it:
sesbania %>% st_crs()
## Coordinate Reference System:
##   User input: Custom 
##   wkt:
## PROJCRS["Custom",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]],
##             ID["EPSG",6269]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["Degree",0.0174532925199433]]],
##     CONVERSION["unnamed",
##         METHOD["Albers Equal Area",
##             ID["EPSG",9822]],
##         PARAMETER["Latitude of false origin",0,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-120,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",34,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",40.5,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",-4000000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1,
##                 ID["EPSG",9001]]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1,
##                 ID["EPSG",9001]]]]

Notice that this CRS is different from the California counties CRS, so we’ll want to update it to match. Use st_transform() to update the CRS:

sesbania <- st_transform(sesbania, 3857)

# Then check it:
sesbania %>%  st_crs()
## Coordinate Reference System:
##   User input: EPSG:3857 
##   wkt:
## PROJCRS["WGS 84 / Pseudo-Mercator",
##     BASEGEOGCRS["WGS 84",
##         DATUM["World Geodetic System 1984",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4326]],
##     CONVERSION["Popular Visualisation Pseudo-Mercator",
##         METHOD["Popular Visualisation Pseudo Mercator",
##             ID["EPSG",1024]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["False easting",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World - 85°S to 85°N"],
##         BBOX[-85.06,-180,85.06,180]],
##     ID["EPSG",3857]]

plot the graphs together

ggplot() +
  geom_sf(data = ca_subset) +
  geom_sf(data = sesbania, size = 1, color ="red")

Do some more wrangling

Let’s say we want to find the count of red sesbania observed locations in this dataset by county. How can I go about joining these data so that I can find counts? Don’t worry…st_join() has you covered for spatial joins!

ca_sesbania <- ca_subset %>% 
  st_join(sesbania)

And then we can find counts (note: these are not counts for individual plants, but by record in the dataset) by county:

sesbania_counts <- ca_sesbania %>% 
  count(county_name)

Then we can plot a chloropleth using the number of records for red sesbania as the fill color (instead of what we used previously, land area):

ggplot(data = sesbania_counts) +
  geom_sf(aes(fill = n), color = "white", size = 0.1) +
  scale_fill_gradientn(colors = c("lightgray", "orange","red")) +
  theme_minimal() +
  labs(fill = "number of S. punicea records")

So we see that we can still use our usual wrangling skills! Let’s do a bit more for fun, just to prove that our existing wrangling skills still work with spatial data - the spatial information just sticks to it! Only plot the county with the greatest number of red sesbania records (Solano), and make a map of those locations (yeah there are many ways to do this):

# Subset of sesbania point locations only in Solano County
solano_sesbania <- sesbania %>% 
  filter(COUNTY == "Solano")

# Only keep Solano polygon from California County data
solano <- ca_subset %>% 
  filter(county_name == "Solano")

ggplot() +
  geom_sf(data = solano) +
  geom_sf(data = solano_sesbania)

Making an interactive map with {tmap}

(fill aesthetic by land area) with the red sesbania locations on top

# Set the viewing mode to interactive:
tmap_mode(mode = "view")

# Then make a map (with the polygon fill color updated by variable 'land_area', updating the color palette to "BuGn"), then add another shape layer for the sesbania records (added as dots):

tm_shape(ca_subset) +
  tm_fill("land_area", palette = "BuGn") +
  tm_shape(sesbania) +
  tm_dots()

See: tmap vignettes Chapter 8 in Robin Lovelace’s “Geocomputation in R” END PART 2